This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated.
# Read CSV files <- Everything after a # is a comment and not evaluated
library(tidyverse) # Load a library that provides more functions
charts <- read_csv("charts_global_at.csv") # Read the data; '<-' is the assignment operator, notice that 'charts' appears on the right
## Rows: 292600 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): trackID, trackName, artistName, artistIds, region, isrc, primary_...
## dbl (19): rank, streams, dayNumber, explicit, trackPopularity, n_available_...
## date (2): day, releaseDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
charts # View some of the data
## # A tibble: 292,600 × 34
## trackID rank streams trackName artistName artistIds day dayNumber
## <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
## 1 003VDDA7J3… 108 982766 YELL OH Trippie R… 6Xgp2XMz… 2020-02-07 1129
## 2 003VDDA7J3… 168 747972 YELL OH Trippie R… 6Xgp2XMz… 2020-02-08 1130
## 3 003vvx7Niy… 197 4574 Mr. Brig… The Kille… 0C0XlULi… 2020-08-29 1333
## 4 003vvx7Niy… 194 4784 Mr. Brig… The Kille… 0C0XlULi… 2020-10-10 1375
## 5 003vvx7Niy… 194 4403 Mr. Brig… The Kille… 0C0XlULi… 2020-10-28 1393
## 6 003vvx7Niy… 193 4024 Mr. Brig… The Kille… 0C0XlULi… 2020-12-29 1455
## 7 003vvx7Niy… 193 4189 Mr. Brig… The Kille… 0C0XlULi… 2020-12-30 1456
## 8 003vvx7Niy… 158 7156 Mr. Brig… The Kille… 0C0XlULi… 2020-12-31 1457
## 9 003vvx7Niy… 158 4069 Mr. Brig… The Kille… 0C0XlULi… 2021-01-01 1458
## 10 003vvx7Niy… 195 650939 Mr. Brig… The Kille… 0C0XlULi… 2020-08-15 1319
## # … with 292,590 more rows, and 26 more variables: region <chr>, isrc <chr>,
## # explicit <dbl>, trackPopularity <dbl>, primary_artistName <chr>,
## # primary_artistID <chr>, artistIDs <chr>, albumName <chr>, albumID <chr>,
## # available_markets <chr>, n_available_markets <dbl>, releaseDate <date>,
## # releaseDate_precision <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>, …
R-code is contained in so called “chunks”. These chunks always start
with three backticks ` and r in curly braces
({r}) and end with three backticks. Optionally, parameters
can be added after the r to influence how a chunk behaves.
Additionally, you can also give each chunk a name. Note that these have
to be unique, otherwise R will refuse to knit your
document. A new code chunk can also be added by using the shortcut
Ctrl+Alt+i (Strg+Alt+i on a German
keyboard).
You can suppress messages and warnings by adding
message=FALSE, warning=FALSE to the chunk header like
so
```{r charts_no_messages, message=FALSE, warning=FALSE}
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts
```
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts
## # A tibble: 292,600 × 34
## trackID rank streams trackName artistName artistIds day dayNumber
## <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
## 1 003VDDA7J3… 108 982766 YELL OH Trippie R… 6Xgp2XMz… 2020-02-07 1129
## 2 003VDDA7J3… 168 747972 YELL OH Trippie R… 6Xgp2XMz… 2020-02-08 1130
## 3 003vvx7Niy… 197 4574 Mr. Brig… The Kille… 0C0XlULi… 2020-08-29 1333
## 4 003vvx7Niy… 194 4784 Mr. Brig… The Kille… 0C0XlULi… 2020-10-10 1375
## 5 003vvx7Niy… 194 4403 Mr. Brig… The Kille… 0C0XlULi… 2020-10-28 1393
## 6 003vvx7Niy… 193 4024 Mr. Brig… The Kille… 0C0XlULi… 2020-12-29 1455
## 7 003vvx7Niy… 193 4189 Mr. Brig… The Kille… 0C0XlULi… 2020-12-30 1456
## 8 003vvx7Niy… 158 7156 Mr. Brig… The Kille… 0C0XlULi… 2020-12-31 1457
## 9 003vvx7Niy… 158 4069 Mr. Brig… The Kille… 0C0XlULi… 2021-01-01 1458
## 10 003vvx7Niy… 195 650939 Mr. Brig… The Kille… 0C0XlULi… 2020-08-15 1319
## # … with 292,590 more rows, and 26 more variables: region <chr>, isrc <chr>,
## # explicit <dbl>, trackPopularity <dbl>, primary_artistName <chr>,
## # primary_artistID <chr>, artistIDs <chr>, albumName <chr>, albumID <chr>,
## # available_markets <chr>, n_available_markets <dbl>, releaseDate <date>,
## # releaseDate_precision <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>, …
In addition you can even hide the code using echo=FALSE
(similar to slides)
```{r charts_no_code, message=FALSE, warning=FALSE, echo=FALSE}
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts
```
## # A tibble: 292,600 × 34
## trackID rank streams trackName artistName artistIds day dayNumber
## <chr> <dbl> <dbl> <chr> <chr> <chr> <date> <dbl>
## 1 003VDDA7J3… 108 982766 YELL OH Trippie R… 6Xgp2XMz… 2020-02-07 1129
## 2 003VDDA7J3… 168 747972 YELL OH Trippie R… 6Xgp2XMz… 2020-02-08 1130
## 3 003vvx7Niy… 197 4574 Mr. Brig… The Kille… 0C0XlULi… 2020-08-29 1333
## 4 003vvx7Niy… 194 4784 Mr. Brig… The Kille… 0C0XlULi… 2020-10-10 1375
## 5 003vvx7Niy… 194 4403 Mr. Brig… The Kille… 0C0XlULi… 2020-10-28 1393
## 6 003vvx7Niy… 193 4024 Mr. Brig… The Kille… 0C0XlULi… 2020-12-29 1455
## 7 003vvx7Niy… 193 4189 Mr. Brig… The Kille… 0C0XlULi… 2020-12-30 1456
## 8 003vvx7Niy… 158 7156 Mr. Brig… The Kille… 0C0XlULi… 2020-12-31 1457
## 9 003vvx7Niy… 158 4069 Mr. Brig… The Kille… 0C0XlULi… 2021-01-01 1458
## 10 003vvx7Niy… 195 650939 Mr. Brig… The Kille… 0C0XlULi… 2020-08-15 1319
## # … with 292,590 more rows, and 26 more variables: region <chr>, isrc <chr>,
## # explicit <dbl>, trackPopularity <dbl>, primary_artistName <chr>,
## # primary_artistID <chr>, artistIDs <chr>, albumName <chr>, albumID <chr>,
## # available_markets <chr>, n_available_markets <dbl>, releaseDate <date>,
## # releaseDate_precision <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>, …
All those options can be set using the
knitr::opts_chunk$set(...) function that is already
included in every new document, e.g.,
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
```
In this chunk we see the option include=FALSE which
means “run the code but do not show the code or any output”. Perfect for
setting preferences the audience does not need to be aware of! Another
useful option is scipen which controls which numbers are
formatted in scientific formatting like so
x <- 100000000000
x
## [1] 1e+11
Changing scipen to something large will tell R to print
zeros instead. Can you hide the following code chunk?
options(scipen = 99999)
x
## [1] 100000000000
We will see more chunk options related to visualizations later.
RMarkdown combines R code/ output with text. You can use this for example, for your thesis.
Usually you want to include some kind of heading to structure your
text. A heading is created using # signs. A single
# creates a first level heading, two ## a
second level and so on.
# First level heading
## Second level heading
##### Fith level heading
It is important to note here that the # symbol means
something different within the code chunks as opposed to outside of
them. If you continue to put a # in front of all your
regular text, it will all be interpreted as a first level heading,
making your text very large.
Bullet point lists are created using *, +
or -. Sub-items are created by indenting the item using 4
spaces or 2 tabs.
* First Item
* Second Item
+ first sub-item
- first sub-sub-item
+ second sub-item
Ordered lists can be created using numbers and letters. If you need
sub-sub-items use A) instead of A. on the
third level.
1. First item
a. first sub-item
A) first sub-sub-item
b. second sub-item
2. Second item
Text can be formatted in italics (*italics*) or
bold (**bold**). In addition, you can add
block quotes with >
> Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.
library(skimr)
skim(charts)
| Name | charts |
| Number of rows | 292600 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 2 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| trackID | 0 | 1.00 | 22 | 22 | 0 | 5516 | 0 |
| trackName | 0 | 1.00 | 1 | 191 | 0 | 4619 | 0 |
| artistName | 0 | 1.00 | 2 | 329 | 0 | 2587 | 0 |
| artistIds | 0 | 1.00 | 22 | 436 | 0 | 2579 | 0 |
| region | 0 | 1.00 | 2 | 6 | 0 | 2 | 0 |
| isrc | 0 | 1.00 | 4 | 12 | 0 | 4713 | 0 |
| primary_artistName | 0 | 1.00 | 2 | 39 | 0 | 1131 | 0 |
| primary_artistID | 0 | 1.00 | 22 | 22 | 0 | 1127 | 0 |
| artistIDs | 0 | 1.00 | 22 | 436 | 0 | 2565 | 0 |
| albumName | 0 | 1.00 | 1 | 191 | 0 | 3178 | 0 |
| albumID | 0 | 1.00 | 22 | 22 | 0 | 3405 | 0 |
| available_markets | 10989 | 0.96 | 2 | 366 | 0 | 257 | 0 |
| releaseDate_precision | 0 | 1.00 | 3 | 5 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| day | 0 | 1 | 2019-01-01 | 2021-01-01 | 2020-01-02 | 732 |
| releaseDate | 0 | 1 | 1942-01-01 | 2021-01-01 | 2019-06-28 | 860 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rank | 0 | 1 | 100.50 | 57.73 | 1.00 | 50.75 | 100.50 | 150.25 | 200.00 | ▇▇▇▇▇ |
| streams | 0 | 1 | 635436.32 | 872930.82 | 2491.00 | 6386.00 | 552818.00 | 950352.00 | 17223237.00 | ▇▁▁▁▁ |
| dayNumber | 0 | 1 | 1092.54 | 211.38 | 727.00 | 909.00 | 1093.00 | 1276.00 | 1458.00 | ▇▇▇▇▇ |
| explicit | 0 | 1 | 0.38 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| trackPopularity | 0 | 1 | 75.95 | 15.08 | 0.00 | 69.00 | 80.00 | 86.00 | 100.00 | ▁▁▂▇▇ |
| n_available_markets | 0 | 1 | 71.80 | 24.36 | 0.00 | 78.00 | 79.00 | 79.00 | 92.00 | ▁▁▁▁▇ |
| danceability | 27 | 1 | 0.70 | 0.13 | 0.13 | 0.63 | 0.72 | 0.80 | 0.98 | ▁▁▃▇▃ |
| energy | 27 | 1 | 0.64 | 0.16 | 0.00 | 0.54 | 0.65 | 0.75 | 1.00 | ▁▁▅▇▂ |
| key | 27 | 1 | 5.51 | 3.59 | 0.00 | 2.00 | 6.00 | 8.00 | 11.00 | ▇▂▅▅▇ |
| loudness | 27 | 1 | -6.28 | 2.35 | -43.99 | -7.34 | -5.96 | -4.76 | 1.51 | ▁▁▁▂▇ |
| mode | 27 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| speechiness | 27 | 1 | 0.13 | 0.11 | 0.02 | 0.05 | 0.08 | 0.18 | 0.94 | ▇▂▁▁▁ |
| acousticness | 27 | 1 | 0.25 | 0.23 | 0.00 | 0.06 | 0.17 | 0.36 | 0.99 | ▇▃▂▁▁ |
| instrumentalness | 27 | 1 | 0.01 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0.96 | ▇▁▁▁▁ |
| liveness | 27 | 1 | 0.17 | 0.13 | 0.02 | 0.09 | 0.12 | 0.19 | 0.96 | ▇▂▁▁▁ |
| valence | 27 | 1 | 0.51 | 0.22 | 0.03 | 0.34 | 0.51 | 0.67 | 0.98 | ▂▆▇▆▃ |
| tempo | 27 | 1 | 120.58 | 28.54 | 45.78 | 97.06 | 119.10 | 139.96 | 216.33 | ▁▇▇▃▁ |
| duration_ms | 27 | 1 | 195400.99 | 38215.08 | 30133.00 | 170560.00 | 192172.00 | 214290.00 | 943529.00 | ▇▃▁▁▁ |
| time_signature | 27 | 1 | 3.98 | 0.30 | 1.00 | 4.00 | 4.00 | 4.00 | 5.00 | ▁▁▁▇▁ |
str(charts)
## spec_tbl_df [292,600 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ trackID : chr [1:292600] "003VDDA7J3Xb2ZFlNx7nIZ" "003VDDA7J3Xb2ZFlNx7nIZ" "003vvx7Niy0yvhvHt4a68B" "003vvx7Niy0yvhvHt4a68B" ...
## $ rank : num [1:292600] 108 168 197 194 194 193 193 158 158 195 ...
## $ streams : num [1:292600] 982766 747972 4574 4784 4403 ...
## $ trackName : chr [1:292600] "YELL OH" "YELL OH" "Mr. Brightside" "Mr. Brightside" ...
## $ artistName : chr [1:292600] "Trippie Redd feat. Young Thug" "Trippie Redd feat. Young Thug" "The Killers" "The Killers" ...
## $ artistIds : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
## $ day : Date[1:292600], format: "2020-02-07" "2020-02-08" ...
## $ dayNumber : num [1:292600] 1129 1130 1333 1375 1393 ...
## $ region : chr [1:292600] "global" "global" "at" "at" ...
## $ isrc : chr [1:292600] "QZJ842000061" "QZJ842000061" "USIR20400274" "USIR20400274" ...
## $ explicit : num [1:292600] 1 1 0 0 0 0 0 0 0 0 ...
## $ trackPopularity : num [1:292600] 75 75 13 13 13 13 13 13 13 13 ...
## $ primary_artistName : chr [1:292600] "Trippie Redd" "Trippie Redd" "The Killers" "The Killers" ...
## $ primary_artistID : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax" "6Xgp2XMz1fhVYe7i6yNAax" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
## $ artistIDs : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
## $ albumName : chr [1:292600] "YELL OH" "YELL OH" "Hot Fuss" "Hot Fuss" ...
## $ albumID : chr [1:292600] "2orYogfKeURqyS1hRP1vZ4" "2orYogfKeURqyS1hRP1vZ4" "4piJq7R3gjUOxnYs6lDCTg" "4piJq7R3gjUOxnYs6lDCTg" ...
## $ available_markets : chr [1:292600] "AD, AE, AR, AT, AU, BE, BG, BH, BO, BR, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES, FI, FR, GB,"| __truncated__ "AD, AE, AR, AT, AU, BE, BG, BH, BO, BR, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES, FI, FR, GB,"| __truncated__ "AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, BR, BY, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES,"| __truncated__ "AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, BR, BY, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES,"| __truncated__ ...
## $ n_available_markets : num [1:292600] 79 79 92 92 92 92 92 92 92 92 ...
## $ releaseDate : Date[1:292600], format: "2020-02-07" "2020-02-07" ...
## $ releaseDate_precision: chr [1:292600] "day" "day" "year" "year" ...
## $ danceability : num [1:292600] 0.842 0.842 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 ...
## $ energy : num [1:292600] 0.578 0.578 0.911 0.911 0.911 0.911 0.911 0.911 0.911 0.911 ...
## $ key : num [1:292600] 6 6 1 1 1 1 1 1 1 1 ...
## $ loudness : num [1:292600] -6.05 -6.05 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 ...
## $ mode : num [1:292600] 0 0 1 1 1 1 1 1 1 1 ...
## $ speechiness : num [1:292600] 0.138 0.138 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 ...
## $ acousticness : num [1:292600] 0.0042 0.0042 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 ...
## $ instrumentalness : num [1:292600] 0 0 0 0 0 0 0 0 0 0 ...
## $ liveness : num [1:292600] 0.228 0.228 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 ...
## $ valence : num [1:292600] 0.19 0.19 0.236 0.236 0.236 0.236 0.236 0.236 0.236 0.236 ...
## $ tempo : num [1:292600] 74.5 74.5 148 148 148 ...
## $ duration_ms : num [1:292600] 236779 236779 222973 222973 222973 ...
## $ time_signature : num [1:292600] 4 4 4 4 4 4 4 4 4 4 ...
## - attr(*, "spec")=
## .. cols(
## .. trackID = col_character(),
## .. rank = col_double(),
## .. streams = col_double(),
## .. trackName = col_character(),
## .. artistName = col_character(),
## .. artistIds = col_character(),
## .. day = col_date(format = ""),
## .. dayNumber = col_double(),
## .. region = col_character(),
## .. isrc = col_character(),
## .. explicit = col_double(),
## .. trackPopularity = col_double(),
## .. primary_artistName = col_character(),
## .. primary_artistID = col_character(),
## .. artistIDs = col_character(),
## .. albumName = col_character(),
## .. albumID = col_character(),
## .. available_markets = col_character(),
## .. n_available_markets = col_double(),
## .. releaseDate = col_date(format = ""),
## .. releaseDate_precision = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double(),
## .. time_signature = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(charts)
When analyzing data in R, you will access most of the functionalities
by calling functions. A function is a piece of code written to carry out
a specified task (e.g., the skim(charts)-function to get an
overview of charts). It may or may not accept arguments or
parameters and it may or may not return one or more values. Functions
are generally called like this:
function_name(argument1 = value1, argument2 = value2)
Functions have a default order of arguments which allows us to omit the argument name and write
function_name(value1, value2)
if we know the correct order. The easiest way to learn about a
function is to look at the help file using ?function_name.
Try it out:
?skim
However, this will only work for loaded packages (after calling
library(skimr)). If you are not sure which (installed)
package provides a function try ??function_name, e.g.,
??nnet
Many packages also come with companion websites and so called vignettes
(btw you can add hyperlinks to your RMarkdown documents using
[text](https://my-link.html).
You can also define your own functions to reuse some operations
add_one <- function(x){
new_value <- x + 1 # intermediate variables in functions are not saved in your environment. Check!
return(new_value)
}
add_one(5)
## [1] 6
add_one(67)
## [1] 68
Of course we can do more interesting stuff like converting temperatures:
\[ ^{\circ}\mathbf{C} = (^{\circ}\mathbf{F} - 32) \times 5/9 \]
FtoC <- function(temperature_f){
return((temperature_f - 32) * 5/9)
}
FtoC(100)
## [1] 37.77778
FtoC(70)
## [1] 21.11111
An example Icecream sales:
data("Icecream", package = "Ecdat")
skim(Icecream)
| Name | Icecream |
| Number of rows | 30 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cons | 0 | 1 | 0.36 | 0.07 | 0.26 | 0.31 | 0.35 | 0.39 | 0.55 | ▆▆▇▂▁ |
| income | 0 | 1 | 84.60 | 6.25 | 76.00 | 79.25 | 83.50 | 89.25 | 96.00 | ▇▆▃▂▃ |
| price | 0 | 1 | 0.28 | 0.01 | 0.26 | 0.27 | 0.28 | 0.28 | 0.29 | ▆▆▇▆▃ |
| temp | 0 | 1 | 49.10 | 16.42 | 24.00 | 32.25 | 49.50 | 63.75 | 72.00 | ▇▃▂▃▇ |
Icecream$temp_c <- FtoC(Icecream$temp) # Using the $ operator we can assign a new variable in an existing data.frame
skim(Icecream)
| Name | Icecream |
| Number of rows | 30 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| cons | 0 | 1 | 0.36 | 0.07 | 0.26 | 0.31 | 0.35 | 0.39 | 0.55 | ▆▆▇▂▁ |
| income | 0 | 1 | 84.60 | 6.25 | 76.00 | 79.25 | 83.50 | 89.25 | 96.00 | ▇▆▃▂▃ |
| price | 0 | 1 | 0.28 | 0.01 | 0.26 | 0.27 | 0.28 | 0.28 | 0.29 | ▆▆▇▆▃ |
| temp | 0 | 1 | 49.10 | 16.42 | 24.00 | 32.25 | 49.50 | 63.75 | 72.00 | ▇▃▂▃▇ |
| temp_c | 0 | 1 | 9.50 | 9.12 | -4.44 | 0.14 | 9.72 | 17.64 | 22.22 | ▇▃▂▃▇ |
ggplot(Icecream, aes(x = temp_c, y = cons)) + # cons -> consumption
geom_point()
The most important types of data are:
| Data type | Description |
|---|---|
| Numeric | Approximations of the real numbers, \(\normalsize\mathbb{R}\) (e.g., mileage a car gets: 23.6, 20.9, etc.) |
| Integer | Whole numbers, \(\normalsize\mathbb{Z}\) (e.g., number of sales: 7, 0, 120, 63, etc.) |
| Character | Text data (strings, e.g., product names) |
| Factor | Categorical data for classification (e.g., product groups) |
| Logical | TRUE, FALSE |
| Date | Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.) |
Variables can be converted from one type to another using the
appropriate functions (e.g.,
as.numeric(),as.integer(),as.character(),
as.factor(),as.logical(),
as.Date()). For example, we could convert the object
y to character as follows:
y <- 5
print(y)
## [1] 5
y <- as.character(y)
print(y)
## [1] "5"
Notice how the value is in quotation marks since it is now of type character.
Entering a vector of data into R can be done with the
c(x1,x2,..,x_n) (“concatenate”) command. In order to be
able to use our vector (or any other variable) later on we want to
assign it a name using the assignment operator <-. You
can choose names arbitrarily (but the first character of a name cannot
be a number). Just make sure they are descriptive and unique. Assigning
the same name to two variables (e.g. vectors) will result in deletion of
the first. Instead of converting a variable we can also create a new one
and use an existing one as input. In this case we omit the
as. and simply use the name of the type
(e.g. factor()). There is a subtle difference between the
two: When converting a variable, with e.g. as.factor(), we
can only pass the variable we want to convert without additional
arguments and R determines the factor levels by the existing unique
values in the variable or just returns the variable itself if it is a
factor already. When we specifically create a variable (just
factor(), matrix(), etc.), we can and should
set the options of this type explicitly. For a factor variable these
could be the labels and levels, for a matrix the number of rows and
columns and so on.
head(charts$explicit)
## [1] 1 1 0 0 0 0
charts$explicit <- factor(charts$explicit, levels = c(1,0), labels = c("Explicit", "Not Explicit"))
skim(charts, explicit)
| Name | charts |
| Number of rows | 292600 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| explicit | 0 | 1 | FALSE | 2 | Not: 181268, Exp: 111332 |
Now let’s create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames.
vec <- c(1,2,3,4,5,6,7,8)
vec
## [1] 1 2 3 4 5 6 7 8
mat <- matrix(vec, ncol = 2)
mat
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
arr <- array(vec, c(2,2,2))
arr
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
li <- list(vector = vec, matrix = mat, array = arr)
li
## $vector
## [1] 1 2 3 4 5 6 7 8
##
## $matrix
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
##
## $array
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
data.frames are similar to matrices but are more
flexible in the sense that they may contain different data types (e.g.,
numeric, character, etc.), where all values of vectors and matrices have
to be of the same type (e.g. character). It is often more convenient to
use characters instead of numbers (e.g. when indicating a persons sex:
“F”, “M” instead of 1 for female , 2 for male). Thus we would like to
combine both numeric and character values while retaining the respective
desired features. This is where “data frames” come into play.
data.frames can have different types of data in each
column. data.frame() creates a separate column for each
vector, which is usually what we want (similar to SPSS or Excel).
df <- data.frame(vec = vec, vec_plus_one = add_one(vec), letters = c('a','b', 'c', 'd', 'e', 'f','g','h'))
df
# matrix will convert everything to characters
matrix(c(vec, add_one(vec), c('a','b', 'c', 'd', 'e', 'f','g','h')), ncol = 3)
## [,1] [,2] [,3]
## [1,] "1" "2" "a"
## [2,] "2" "3" "b"
## [3,] "3" "4" "c"
## [4,] "4" "5" "d"
## [5,] "5" "6" "e"
## [6,] "6" "7" "f"
## [7,] "7" "8" "g"
## [8,] "8" "9" "h"
data.frame# Single column
Icecream$temp_c
## [1] 5.0000000 13.3333333 17.2222222 20.0000000 20.5555556 18.3333333
## [7] 16.1111111 8.3333333 0.0000000 -4.4444444 -2.2222222 -3.3333333
## [13] 0.0000000 4.4444444 12.7777778 17.2222222 22.2222222 22.2222222
## [19] 19.4444444 15.5555556 6.6666667 4.4444444 0.0000000 -2.7777778
## [25] -2.2222222 0.5555556 5.0000000 11.1111111 17.7777778 21.6666667
# Multiple columns
Icecream[, c("temp_c", "temp")]
# First row
Icecream[1, ]
# First 5 rows
Icecream[1:5, ]
# Combination
Icecream[1:5, c("temp_c", "temp")]
In general: click on file via the RStudio file explorer and the correct function will be shown. Copy that into your script.
In some cases we want to download data from external sources (e.g., APIs). There are a couple of packages that can facilitate that.
Wikipedia includes many interesting and up-to-date tables. For example you might be looking for a suitable TikTok influencer for your products:
library(rvest)
library(janitor)
library(stringr)
most_followed_link <- 'https://en.wikipedia.org/wiki/List_of_most-followed_TikTok_accounts'
most_followed_page <- read_html(most_followed_link)
most_followed_tables <- html_nodes(most_followed_page, "table.wikitable")
most_followed <- most_followed_tables[[1]] %>% html_table(fill = TRUE)
most_followed
names(most_followed)
## [1] "Rank" "Username"
## [3] "Owner" "Followers[10](millions)"
## [5] "Likes[10](millions)" "Description"
## [7] "Country" "Brand Account"
most_followed <- clean_names(most_followed)
names(most_followed)
## [1] "rank" "username" "owner"
## [4] "followers_10_millions" "likes_10_millions" "description"
## [7] "country" "brand_account"
names(most_followed) <- str_remove(names(most_followed), "10_")
names(most_followed)
## [1] "rank" "username" "owner"
## [4] "followers_millions" "likes_millions" "description"
## [7] "country" "brand_account"
Reading data from websites can be tricky since you need to analyze the page structure first. Many web-services (e.g., Facebook, Twitter, YouTube) actually have application programming interfaces (API’s), which you can use to obtain data in a pre-structured format. JSON (JavaScript Object Notation) is a popular lightweight data-interchange format in which data can be obtained. The process of obtaining data is visualized in the following graphic:
Obtaining data from APIs
The process of obtaining data from APIs consists of the following steps:
Identify an API that has enough data to be relevant and reliable
(e.g., www.programmableweb.com has
>12,000 open web APIs in 63 categories). Request information by
calling (or, more technically speaking, creating a request to) the API
(e.g., R, python, php or JavaScript). Receive response messages, which
is usually in JavaScript Object Notation (JSON) or Extensible Markup
Language (XML) format. Write a parser to pull out the elements you want
and put them into a of simpler format Store, process or analyze data
according the marketing research question. Let’s assume that you would
like to obtain population data again. The World Bank has an API that
allows you to easily obtain this kind of data. The details are usually
provided in the API reference, e.g., here. You simply “call” the API for
the desired information and get a structured JSON file with the desired
key-value pairs in return. For example, the population for Austria from
1960 to 2019 can be obtained using this call. The file can be easily
read into R using the fromJSON()-function from the
jsonlite-package. Again, the result is a list and the second element
ctrydata[[2]] contains the desired data, from which we
select the “value” and “data” columns using the square brackets as usual
[,c("value","date")]
library(jsonlite)
url <- "http://api.worldbank.org/v2/countries/AT/indicators/SP.POP.TOTL/?date=1960:2021&format=json&per_page=100" #specifies url
ctrydata <- fromJSON(url) #parses the data
str(ctrydata)
## List of 2
## $ :List of 7
## ..$ page : int 1
## ..$ pages : int 1
## ..$ per_page : int 100
## ..$ total : int 61
## ..$ sourceid : chr "2"
## ..$ sourcename : chr "World Development Indicators"
## ..$ lastupdated: chr "2022-02-15"
## $ :'data.frame': 61 obs. of 8 variables:
## ..$ indicator :'data.frame': 61 obs. of 2 variables:
## .. ..$ id : chr [1:61] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
## .. ..$ value: chr [1:61] "Population, total" "Population, total" "Population, total" "Population, total" ...
## ..$ country :'data.frame': 61 obs. of 2 variables:
## .. ..$ id : chr [1:61] "AT" "AT" "AT" "AT" ...
## .. ..$ value: chr [1:61] "Austria" "Austria" "Austria" "Austria" ...
## ..$ countryiso3code: chr [1:61] "AUT" "AUT" "AUT" "AUT" ...
## ..$ date : chr [1:61] "2020" "2019" "2018" "2017" ...
## ..$ value : int [1:61] 8917205 8879920 8840521 8797566 8736668 8642699 8546356 8479823 8429991 8391643 ...
## ..$ unit : chr [1:61] "" "" "" "" ...
## ..$ obs_status : chr [1:61] "" "" "" "" ...
## ..$ decimal : int [1:61] 0 0 0 0 0 0 0 0 0 0 ...
ctrydata[[2]][,c("value","date")]
ctrydata[[2]]$date <- as.numeric(ctrydata[[2]]$date)
ggplot(ctrydata[[2]], aes(x = date, y = value)) +
geom_line()
Try to recreate the following table for the “ease of doing business”
indicator (see function arrange)
doing_business_url <- "http://api.worldbank.org/v2/countries/all/indicators/IC.BUS.EASE.XQ/?date=2019&format=json&per_page=6000" #specifies url
#...
dplyr & tidyrBoth dplyr and tidyr are already included
in the tidyverse package so we don’t have to load anything else.
From dplyr we are going to use the following functions
select() picks variables based on their names. Reduces
columns.filter() picks cases based on their values. Reduces
rows by removing based on filtering function.mutate() adds new variables that are functions of
existing variables. Adds column(s).summarize() reduces multiple values down to a single
summary. Reduces rows by summarizing values.arrange() changes the ordering of the rows. Sorts data
based on column(s)These combine naturally with group_by() which allows you
to perform any operation “by group”.
For filtering we will need the following logical operations
Logical operations
| Operation | Description | Example | Result |
|---|---|---|---|
a==b |
a equal b | 8/2==4 |
TRUE |
a!=b |
a not equal b | 8/2!=5 |
TRUE |
a>b |
a greater b | 2*2>3 |
TRUE |
a>=b |
a greater or equal b | 5>=10/2 |
TRUE |
a<b |
a less b | 6/2 < 5 |
TRUE |
a<=b |
a less or equal b | 5<=10/2 |
TRUE |
Logical AND: &&
e.g. 5>=4 && 7>5 \(\Rightarrow\) TRUE
Logical OR: || e.g. 5>=4 || 7>10
\(\Rightarrow\) TRUE
& and |: element-wise;&& and ||: only first elementselect(charts, trackName, region, day)
filter(charts, danceability > 0.96, region == "at", explicit == "Not Explicit")
# %>% inserts the previous output as the first argument
mutate(charts, log_streams = log(streams)) %>%
select(trackName, region, day, streams, log_streams)
group_by(charts, trackName) %>%
mutate(streams_std = scale(streams),
streams_mean = mean(streams),
streams_sd = sd(streams),
streams_std_manual = (streams - streams_mean)/streams_sd) %>%
select(trackName,streams_mean, streams_sd, streams_std, streams_std_manual, streams)
summarize(charts, streams=sum(streams))
group_by(charts, trackName) %>%
summarize(total_streams = sum(streams))
group_by(charts, artistName) %>%
summarize(total_streams = sum(streams)) %>%
arrange(desc(total_streams))
group_by(charts, trackName) %>%
filter(region == "global") %>%
summarize(days_in_charts = n(), total_streams = sum(streams), avg_rank = mean(rank)) %>%
filter(days_in_charts > 720) %>%
arrange(desc(days_in_charts))
The tidyr package provides functions to “pivot” tables from long to wide and vice versa.
year_streams <- filter(charts, region == "global") %>%
group_by(year = format(day, "%Y"), trackName) %>%
summarize(streams = sum(streams))
year_streams
year_wide <- pivot_wider(year_streams, names_from = year, values_from = streams)
year_wide
filter(year_wide, across(`2019`:`2021`, ~ !is.na(.)))
filter(year_wide, !is.na(`2019`) & !is.na(`2020`) & !is.na(`2021`))
Usually the more useful function is pivot_wider because
most packages (e.g., ggplot2) expect long data.
pivot_longer(year_wide, `2019`:`2021`,
names_to = "year",
values_to = "streams",
values_drop_na = TRUE)
Let’s create a plot that shows the streams for the top 10 artists in the sample for 2019 and 2020. First we prepare the data
top10artists <- filter(charts, format(day, '%Y') %in% c("2019", "2020")) %>%
group_by(artistName) %>%
summarize(total_streams = sum(streams)) %>%
top_n(n=10,total_streams)
top10artists
top10streams <- filter(charts, artistName %in% top10artists$artistName & format(day, '%Y') %in% c("2019", "2020")) %>%
group_by(artistName, year = format(day, "%Y")) %>%
summarise(streams = sum(streams))
## `summarise()` has grouped output by 'artistName'. You can override using the
## `.groups` argument.
top10streams$year <- factor(top10streams$year, levels = c("2019", "2020"))
top10streams
ggplotThe ggplot function prepares the “canvas” for the plot
by looking at the data we want to plot. The axes and coloring/fill color
can be passed to aes as variables.
ggplot(top10streams, aes(x = artistName, y = streams, fill = year))
Next we add (+) layers to the plot to show the data. To
see which artists are increasing their success its better to draw the
two bars next to each other.
ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
geom_bar(stat = "identity", position = "dodge")
Alternatively, if we are more interested in total streams:
ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
geom_bar(stat = "identity", position = "stack")
We can fix the overlapping labels by adding another layer:
ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
Next we want to have an ordering to easily compare the success of the artists:
# arrange by total streams:
top10streams$artistName <- factor(top10streams$artistName,
levels = arrange(top10artists, desc(total_streams))$artistName)
ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
# arrange by 2020 streams
level_order <- filter(top10streams, year=="2020") %>% arrange(desc(streams))
top10streams$artistName <- factor(top10streams$artistName, levels = level_order$artistName)
plt_streams <- ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
plt_streams
Finally let’s clean the plot up a bit. Notice that we can save plots
and add to them later! In addition the width of the plot is increase
through the chunk option fig.width=12 here.
plt_streams +
ggtitle("Total streams of most successful artists", subtitle = "2019-2020") + # add title layer
theme_bw() +
theme(panel.border = element_blank(), # remove box around plot
axis.line = element_line(color = 'black'), # add x, y axes
panel.grid.major.x = element_blank(), # remove x grid lines
legend.title = element_blank(), # remove "year" from legend
axis.title.y = element_text(size = 16), # increase y title text size
axis.title.x = element_blank(), # remove x title
axis.text = element_text(size = 15), # increase text size of labels on both axes
legend.text = element_text(size = 15), # increase legend text size
title = element_text(size = 18)) + # increase title text size
scale_y_continuous(expand = expansion(mult = c(0, .1)), # remove spacing on bottom
labels = scales::comma) # add commas to the number of streams
charts %>%
filter(region == "global") %>% # Only global streams
group_by(day) %>% # We want to summarize per day
summarize(streams = sum(streams)) %>% # Calculate sum of streams
ggplot(aes(x = day, y = streams)) + # plot setup
geom_line() + # add lines
ggtitle("Total global streams of top 200 songs") # add title